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ABSTRACT 


Efficient parallel algorithms for SIMD (Single 
Instruction strean = Multiple Data stream) parallel 
computers are considered for a variety of problems. Included 
are efficient parallel algorithms for performing integer 
acithmetic, polynomial arithmetic and modular arithmetic. 
These parallel algorithms lead to efficient or 
(asymptotically) optimal methods for solving the following 
problems: 

1) parallel integer arithmetic: 

addition, subtraction, multiplication and division 
of multiple-precision integers. 

2) parallel polynomial arithmetic: 

addition, subtraction, multiplication, division and 
evaluation of polynomials, and evaluation of 
elementary symmetric functions. 

3) parallel modular arithmetic: 

residue evaluation of multiple-precision integers 
and of polynomials, the Chinese remainder problen, 
polynomial interpolation, and the solution of full 
linear systems of equations by modular methods. 

Among the parallel algorithms given in this thesis, 
three broad classes can be identified according to their 
level of parallelism: 

1) parallel algorithms assuming linearly bounded 
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2) 


3) 


integers, addition and subtraction of polynomials. 
parallel algorithms assuming polynomially bounded 
parallelisn, Ce Gey multiplication of integers, 
multiplication of polynomials, residue evaluation of 
polynomials, polynomial interpolation, and the 
evaluation of elementary symmetric functions. 
parallel aigorithms assuming exponentially bounded 
parallelism, e.g., division of integers, division of 
polynomials, residue evaluation of multiple- 
precision integers, the Chinese remainder problem, 
and the solution of full linear systems of 


equations. 
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CHAPTER ONE 


The central theme of this thesis is optimal parallel 
computations for SIMD (Single Instruction stream - Multiple 
Data stream) parallel computers (Cf. Section 1.1). Many 
people think that parallel computations are extensions of 
sequential computations. This view is quite wrong. Parallel 
computations give rise to unique problems and issues beyond 
those of sequential computations [Stone 1973, Newell and 
Robertson 1975]. In fact, some mistakes of earlier research 
(and including some recent research) in parallel 
computations can be directly contributed to this view. For 
example, it is not surprising that adapting the classical 
algorithms such as Gauss-Jordan elimilation, LDU 
factorization, or Givens reduction on parallel computers is 


far from optimal. 


In order to prepare discussion for subsequent chapters, 
this chapter is intended to clarify some important points 


and define our objectives. 


In Section 1.1 we classify parallel computers according 
to Flynn's stream concept. Two broad categories can be 
obtained: the SIMD parallel computers and the MIMD (Multiple 
Instruction stream = Multiple Data stream) parallel 
computers. We explain why SIMD parallel computers are 
favored in this thesis. Also, in this section, the 


characteristics of idealized SIMD parallel computers, which 
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serve as the basis of all parallel algorithms in the 


subsequent chapters, are described. 


In Section 1.2 models of parallel computation and 
measurements of parallel complexity are given. Since the aim 
of this study is to gain insight into parallel computational 
complexity, we concentrate on the optimality and asymptotic 


behavior of parallel algorithms. 


In Section 1.3 we briefly summarize the contributions 


of this thesis. 
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1.1 Classification of Parallel Computers 


Many computers now exist which are capable of executing 
more than one instruction simultaneously [{ Barnes et al 1968, 
Hintz 6& Tate 1972, Johnson 1972, Rudolph 1972, Watson 1972, 
Wulf & Bell 1972]. Although all these computers are grossly 
termed as parallel computers, there are significant 
functional and architectural differences between any two 
parallel computers. A number of different approaches to 
Classification of parallel computers are possible [Thurber & 
Wald 1975]. Most techniques uSe global architectural 
properties and are thus valid only within limited ranges. 
For the purpose of studying parallel computations, Flynn's 
stream concept [1966, 1972] will be adopted. Stream in this 
context Simply means a Sequence of instructions or data as 
executed or operated on by a processoc. The advantage of 
this approach is that it describes a parallel computer from 
a macroscopic point of view, and yet avoids the pitfalls of 


relating such descriptions to a particular problen. 


Using the stream concept, parallel computers are 
categorized by the magnitude (either in space or time 
multiplex) of interactions of their instruction and data 


streams. The two major categories are: 


1) The Single Instruction stream - Multiple Data stream 
(SIMD) parallel computers have only one stream of 
instructions in execution at any time, but each 


instruction may affect many different data. These 
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been out 


parallel computers can be further categorized as 

being 

a) parallel in space, aS are structured array 
computers (1) (ILLIAC IV, OMEN-60, SIMDA); 

b) unstructured linear array computers, or ensembles 
(2) (PEPE, Goodyear STARAN S); 

Cc) parallel primarily in time, as are pipeline 
computers (CDC STAR-100, TI ASC); 

d) associative computers whose memories are 
addressed by contents rather than by addresses 


(Goodyear STARAN S, PEPE, OMEN-60, SIMDA). 


2) The Multiple Instruction stream - Multiple Data 
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stream (MIMD) parallel computers have more than one 


stream of instructions, in fact, as many instruction 
streams as there are data streams. These parallel 
computers are essentially interconnected sequential 
computers, usually called multiprocessors (UNIVAC 


1108 multiprocessor system, C.mmp 63), IMP). 


The MIMD parallel computers that have been built so far 
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(1) A structured array computer has a high level of 
interconnectivity such as exhibited by ILLIAC IV's four 
nearest neighbor connections. 


C2) An ensemble such as PEPE tends not to have much 
interprocessor connectivity except by transmission to the 
control unit and retransmission to the appropriate 
processor. 


(3) Ccmmp consists of 16 PDP-11 computers connected through 
a crosspoint switch to 16 primary-memory ports. 
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tend to have only 2, 4, or in care instances 16 processors 
(C.mmp) , whereas the SIMD parallel computers have an 
effective parallelism as small as 64 (ILLIAC IV) to upwards 
of 288 (PEPE) and 1024 (STARAN S) 4). In addition, it is 
likely that the scale gap will become greater and greater 
due to the recent microprocessor / computer-on-a-chip 
revolution [Soucek 1976]. Unfortunately, most cesearch 
previously done in the area of parallel computations is 


strongly oriented towards HIMD computations. 


Since it is evident that technology for SIMD parallel 
computers is approaching maturity and a good number of large 
scale SIMD parallel computers are available, there is an 
urgent need for efficient parallel algorithms designed 
Specifically for SIMD parallel computers. Without efficient 
parallel algorithms these computer giants are just ordinary 
dwarfs, and hardware worth millions of dollars is just 
another engineering spectacle. Therefore, in this thesis we 
are solely interested in designing parallel algorithms for 
SIMD parallel computers. (5S) Proposed parallel algorithms 
are assumed to be executed on idealized SIMD parallel 


computers which can be characterized as follows: 


(4) STARAN has a potential parallelism as large as 8192 in 
the most advanced model. 


(5) Any parallel algorithm designed for SIMD parallel 
computers can easily be adapted for MIMD parallel computers, 
but not conversely. 
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1) 


2) 


3) 


4) 


5) 


There are an indefinite number (6) of identical 
processors, each able to execute the usual 
arithmetic, Boolean, relational (i.e., comparison), 
and control operations, and each with its own 
MCMOLy. 

All processors obtain their instructions 
Simultaneously from a single instruction stream 
broadcast by the control unit. Thus, all processors 
execute the same instruction, but operate on data 
stored in their own memories. 

Each processor has a distinct number by which it may 
be identified by the control unit or referenced by 
an instruction. 

Any processoc may be disabled from performing an 
instruction. The enabling or disabling of individual 
processors is effected according to the result of 
some local test. 

All processors can exchange data with each other 
over predefined data paths (ILLIAC IV) or through a 


data permutation network (STARAN S). 


However, this number is bounded [Kuck & Muraoka 1974, 
Kuck & Maruyama 1975, Chen & Kuck 1975]. 
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1.2 Complexity of Parallel Algorithms 


If algorithms for solving certain problems are to be 
analyzed and if lower bounds for solving these problems are 
to be obtained, models of computation and measurements of 
complexity must be first given. The following definition is 
evolved from a sequential computational model [Winograd 
1970]. A parallel model similar to this one is proposed by 


Munro and Paterson [1973]. 


Definition 1.2.1 Given any three sets A, B and R, a 


me ee eee eee ee eS me ce 


computation of R in A given B is a finite sequence of non- 
ft 


empty sets S,c A, 0 <i $ t, Such that Sipe Br Rou on 

i=0 ab 

and for each i> 0, if x « S. then x = (y op Zz), where y, z 
i-1 

Lo S, and op « O.- The set 0. contains the types of all 


Operations used for obtaining S |- The integer t is the 
Number of operations required for the computation. The 


elements in R are the results of the computation. 


Note that in the above definition all the intermediate 
results Sie 1< i < t, are contained in the set A. In fact, 
this is often a convenient way to express the types of 
operations which are allowed in a computation. For example, 
let F be an arbitrary algebraically complete field 7) and 
{x, | 1<i<n } be the input arguments. If arithmetic and 


only arithmetic operations are allowed for a computation, 


(7) Throughout this thesis F denotes an algebraically 
complete field. 
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then the set A can be expressed as the (symbolic) rational 
field P(x), X50 ecee X_)- On the other hand, if the set A is 
specified as the ring F[X1 Xoe sees Xe then this is 
implicitly saying that division is not allowed. In the more 
general case, however, all arithmetic, Boolean and 
relational operations are allowed (Cf. Section 1.1) and a 


definition of the set A is normally omitted. 


Example. Let w be a primitive n-th root of unity in F. 


Then the following identity holds: 
Bow / (x-w) = 0/7 (x -1)- 


Based on this identity, the following sequence of sets is a 


computation for evaluating x® in F(x) given F U {x} (Kung 


1974}. 

Sets Comments 

So =F U {x}, initially given. 

Bese tAaS= ae | 15158}, OFn= tater Ste” Shr is e Soe 
S5 = (B,<- wy A | 1sis8}, O0,= {/}- 

S867 <- 035127 EB pele tSiStbee) Ome siti- 

S, = {D4<- Co;_7+ Coy | 18452}, O,= {t}- 

S5 = {E <- D+ Do}e O,= {t+}. 

Se = {G <- 8 / E}, Or, =a} ake ta BS 

S7= (Gti}-. O saith. Wem, 


The number of operations required for this computation is 7. 


Also, note that all (and only) arithmetic operations are 
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used in the computation. Thus, the computation is indeed in 


Fi(x) 


computation if and only if {S.J = 1 for all i > 0; otherwise 


it is called a parallel computation. (9) 


Note that in the above example, since {S 


computation is a parallel computation. 


3 A parallel computation is called a 
SIMD computation if and only if |0.| = 1 for all i > 0; 


= oe a 


otherwise it is called a MIMD computation. 


Note that in the above example, since 10. | = 1 for all 


i, the parallel computation is a SIMD computation. 


Parallel algorithms can be measured by a variety of 
criteria (Cf. [Stone 1973] for alternative criteria). 
Primarily, we are interested in the rate of growth of the 
number of parallel operations required for solving very 
large instances of a problem. The number of parallel 
arithmetic and Boolean operations 9) required by a parallel 
algorithm expressed as an integer function of the attributes 


(8) Given any set A, JA| denotes the cardinality of A. 


(9) We ignore all overhead of memory access and data 
exchange (Kung 1974, Brent 1974, Chen & Kuck 1975, Kuck & 
Mucaoka 1975, Kuck 6& Maruyama 1975]. This assumption is 
justifiable because overhead is roughly proportional to the 
number of parallel operations performed and ignoring it 
affects the parallel complexity by a constant factor only. 
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(e-g-, the precision of an integer, the size of a vector, 
the order of a matrix, etc.) of the input arguments is 


called the parallel complexity of the parallel algorithn. 


Let f(n) be the parallel complexity of a parallel 
algorithm for solving a certain problem with a single 
attribute and g (n) be the parallel lower bound for solving 
the problem. The parallel algorithm is said to be optimal if 
and only if f(n) = g(n) + O(1) and asymptotically optimal if 
and only if £(n) = O(g(n)). 19) The (asymptotic) optimality 
of a parallel algorithm with more than one attribute can be 


Similarly defined. 


We emphasize that, just as is the case in alnost all 
research done in sequential computational complexity [Aho et 
al 1974, Borodin & Munro 1975], rather than attempting to 
provide the best practical parallel algorithms, the aim of 
this study is to gain insight into the intrinsic difficulty 
for solving various problems on SIMD parallel computers. 
This emphasis leads us to concentrate on the optimality and 
the asymptotic behavior of parallel algorithms. Thus, in 
certain instances, no matter how straightforward Or 
impractical the parallel algorithms are, they are still 
given and analyzed simply to show that these parallel 


algorithms are (asymptotically) optimal or that the parallel 


C10) Given two sequences (i-e., functions defined on 
integers) f(n) and g(n) such that g(n) 2 0 for all ny, we 
write f(n) = O(g(n)) if and only if there exists a constant 


K > 0 such that If(n)| < Keg(n) for all n. 
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lower bounds can be attained. Of course, in other instances, 
the parallel algorithms given are not only intricate but 


also practically useful. 
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1.3 Contributions 


The problems we are concerned with are the fundamental 
processes for doing different kinds of arithmetic including 
integer arithmetic, polynomial arithmetic and modular 


arithmetic. 


On sequential computers, the subject has been 
thoroughly covered in Chapter 4 of Knuth's book [1969]. On 
parallel computers, except for a few isolated results, very 


little research has been done. 


The main contributions of this thesis are efficient or 
(asymptotically) optimal parallel algorithms for the 
following problems: 

1) parallel integer arithmetic: 

a) addition or subtraction of n-digit integers can 
be computed in O(log n) parallel operations. 

b) multiplication of an m-digit integer by an n- 
digit integer can be computed in 0O(1l0g(m+tn)) 
parallel operations. 

c) division of an (mtn)-digit integer by an n-digit 
integer can be computed in O(log(mtn)) parallel 
operations. 

2) parallel modular arithmetic: 

a) polynomial interpolation at n points can be 
computed in O(log n) parallel operations. 

b) the Chinese remainder problem with n moduli can 
be solved in O(log n) parallel operations. 


c) full linear systems of order nm can be solved 
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exactly in O(log n) parallel operations. 
3) parallel polynomial arithmetic: 
a) division of a polynomial of degree mtn by a 
polynomial of degree n can be computed in 


O(log(m+1)) parallel operations. 


Other contributions include the derivation of tight 
parallel lower bounds for the evaluation of polynomials and 


elementary symmetric functions. 
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CHAPTER THO 


Presently, techniques for proving parallel lower bounds 
are not only scarce but also immature. Among the few 
ceported parallel lower bounds, results are either left 
unproved or proved in an ad hoc manner. More seriously, 
existing technigues fail to give tight parallel lower bounds 
for even very common problems. The purpose of this chapter 
is to discuss techniques for finding tighter parallel lower 


bounds. 


The fan-in argument, which gives a relationship between 
the number of input arguments, or the minimal number of 
sequential operations, and the minimal number of parallel 
Operations, is illustrated in Section 2.1; the growth 
argument, which determines a relationship between the growth 
of a result and the minimal number of parallel operations, 
in Section 2.2; and the substitution argument, which gives a 


relationship between classes of problems, in Section 2.3. 


Section 2.3 is concluded with an example to demonstrate 
how the three arguments are combined to obtain a tighter 


parallel lower bound. 
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2.1 Fan-in Argument 


Parailel computations are frequently represented by 
binary trees or, more generally, binary forests (i.e., 
collection of trees), in which the roots correspond to 
results, the leaves (or external nodes) correspond to input 
arguments, the (internal) nodes correspond to operations 
performed on previously available results, arguments, or 
constants and the height of the tree corresponds to the 
number of parallel operations performed. By this analogy, 
properties about binary trees can be immediately transformed 
into properties about parallel computations. The following 


fact about binary trees is well known. 


Fact. The height of a binary tree with n leaves is at 


least -log ny. (11) (212) 


Using this fact, a simple relationship between the 
number of input arguments and the minimal number of parallel 


operations required can be established. 


1 At least -log n, parallel operations are 


required to compute a result which depends on n arguments. 


The technigue, generally called the fan-in argument, 


was first adopted by Winograd [1965, 1967] to prove lower 


(11) Throughout this thesis all logarithms are taken base 2, 
unless otherwise stated. 


(12) Given any number x, ¢xX, denotes the smallest integer 2 
x, and txs denotes the greatest integer < x. 
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bounds for logical circuits. Variations of this lemma are 
either stated [Kuck 6& Muraoka 1974], proved (Munro 6 
Paterson 1973], oc simply used implicitly in many papers 
which deal with parallel computations. To illustrate the use 
of this lemma, we apply it to show some well-known 
fundamental results. The first is the summation of several 


numbers. 


Lemma 2.1.2 At least -log ny parallel operations are 
required to compute the sum of n numbers, 
E = x,t 25 oss e@ea + Xie 


in F(Xj¢ x) given F JU {Xj Koe sees x} 


Furthermore, this bound can be attained. 


Proof. Since E depends on n input arguments, by Lemma 
2.1.1, at least ¢log ny, parallel operations are reguired to 


compute E. 


We construct a computation tree for E such that { x; | 
1si<n } are at the bottom level, x, + Xoe x,t Xie ees arerat 
the next level, x,+ X5+t K+ X,, Xo+ Kot Kit Koy ee are at 
the still next level, and so on, until the highest level is 


E. Obviously, the height of the tree is slog ny. Thus, the 


parallel lower bound is attained. OcE.D. 


The parallel algocithm given in the above lemma is 
called the log-sum algorithm in the ILLIaAC IV literature. 


Similar parallel algorithm, called the log-product 


numbers. 
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inner products are common operations in linear algebra. 
Matcix multiplication, for example, can be regarded as the 
computation of many inner products simultaneously. The 
following theorem gives a parallel lower bound for computing 
inner products, and consequently, also for computing the 


product of two matrices. 


Theorem 2.1.3 At least -log ny+1 parallel operations 
are required to compute the inner product, 
SEB LARS aA tS HG 
in F(X). Yie eeee Xo YJ given F JQ {x} oO {y ,}- 


Furthermore, this bound can be attained. 


Proof. Since £— depends on 2n input arguments, by Lemma 


2-1.1, at least -log({2n), = yflog ny+1 parallel operations 


are required. 


The parallel algorithm to achieve the parallel lower 
bound is guite similar to the usual one. The n_ products, 
X;Yj0 18 i na, are computed in one parallel multiplication 
and the products are then summed in flog ny, parallel 


aiditions by the log-sum algorithn. Oc nels 


Corollary 2.1.4 At least rlog ny+1 parallel operations 
are required to compute matrix-matrix, matrix-vector, or 
vectoc-matrix multiplications of order n. Furthermore, this 


bound can be attained. 


Relationships between the number of sequential 


operations required and the number of parallel operations 
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required can be established by first Observing the following 


fact about binary trees. 


Fact. The number of internal nodes in a binary tree is 


one less than the number of leaves. 


Using this fact and Lemma 2.1.1, and also recalling the 
correspondence between binary trees and parallel 


computations, the following lemma follows immediately. 


required to compute a result which requires at least n 


sequential operations. (13) 


Provided that the sequential lower bound for certain 
problem is known, the above lemma immediately yields a 
parallel lower bound. In fact, an alternative proof for 
Theorem 2.1.3 uSing this idea appears in [Borodin & Munro 


1975]. 


The following lemma due to Kedem and Kirkpatrick [1974] 
gives the sequential lower bound for the summation of n 


numbers. 


(13) It is interesting to note the dual of Lemma 2.1.5: 


Lemma 2.1.5" At most 2°-1 sequential operations are 
required to compute a result which can be computed in n 
parallel operations. 


The lemma establishes a relationship between parallel upper 
bounds and sequential upper bounds, but we do not explore 
this direction any further. 
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2-6 At least n-1 additions are reguired to 


E = xa? a es Xn 


in P(X) Xoe ceee X) given F U {X) 0 X50 coer X}- 
Proof. Cf. Borodin and Munro [1975, pp. 12-15]. Q.E.D. 


Lemma 2.1.2 states that at least -ylog ny parallel 
Operations are required to compute E. But on many occasions 
we are more interested in the specific types of operations 
involved, i.e. additions, subtractions, multiplications, 
divisions, or Boolean operations. On a SIMD parallel 
computer, although many operations can be performed 
Simultaneously, only one type of operation can be performed 
at any time. This point is illustrated by proving a stronger 


version of Lemma 2.1.2. 


required to compute 
E = Xj + Xo + ooo + Xie 


in F(x, Xie wialets x) given F JU {x Xe aieteva x} 
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Proof. The theorem is an immediate consequence of Lemma 


221.5 and Lemma 2.1.6. QSE ade 
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2-2 Growth Argument 


The fan-in argument as illustrated in the previous 
section is a ubiquitous yet simple technique for proving 
parallel lower bounds. However, there are severe limitations 
in using it alone. Very often the bounds obtained from it 
are too trivial to be useful. For example, the fan-in 
argument can be used to prove the following result: 

At least log 1 = 0 parallel operation is required to 
compute x”. 


The result, however, is hardly useful. 


The growth argument can complement the fan-in argument 
for obtaining more useful parallel lower bounds. Unlike "eRe 
fan-in argument, it establishes a relationship between the 
growth of a result and the minimal number of parallel 
operations required. We proceed to illustrate the growth 


argument by proving the following lemma. 


1 At least -log n, parallel operations are 
required to compute x” in F(x) given F U {x}. Furthermore, 


this bound can be attained. (14) 


Proof. The parallel lower bound is proved by induction 
on n, the degree of x". First, the parallel lower bound is 


obviously true foc n = 1. Suppose that the lemma holds for n 


(14) The sequential complexity of this problem is flog ny + 
O({log n)/(log log n)) [Knuth 1969]. This is one example of 
a problem where there is very little advantage in solving it 


on a parallel computer. 
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[eee 
2, 1.e., k parallel operations are required. Since only 


< 
rational functions (including powers of x) of degree < ons 
can be computed, then any function computed at the (k+1)-st 
parallel operation can not exceed degree 2k+1, (15) Thus, 


the parallel lower bound holds for all n. 


The parallel lower bound is attained by using the log- 


product algorithn. Q.E.D. 


Theorem 2.2.2 At least -log n, parallel multiplications 


are required to compute x” in F[x] given F U {x}. 


Proof. Since in the ring F(x], deg(fxg) = deg(f) +deg(q) 
and deg(ftg) < {deg(f),deg(g)}. The only operation which can 


incur any increase in degree is multiplication. Q.E.D. 


Note that the theorem is not true if the intermediate 
results are not confined to F[x]. In fact, Kung [1974] 
presents a parallel algorithm in F(x) given F U {x} which 
computes x” in 2 parallel divisions and flog ny,+2 parallel 


additions and no multiplication (Cf. example in Chapter 1). 


(15) Any rational function f(x) in F(x) can be expressed as 
a formal quotient of two relatively prime polynomials f(x) = 
a(x) /b(x), and the degree of f is defined as deg(f) = max 


holds 
deg(f op g) < deg(f) + deg(g), 


for any rational functions f and g and where op is any of 
the operations +, -, *, and /. 
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2.3 Substitution Argument 


The results obtained from the fan-in argument and the 
growth argument in the last two sections are very Simple 
ones. For slightly more complicated problems the techniques 
are deemed insufficient. However, a simple idea which can 


boost the power of the previous techniques works as follows: 


Let A be any algorithm for solving a class of 
problems, P. If -we substitute some of the input 
arguments of the algorithm A with specific values, then 
A is transformed into another algorithm A" for solving 
a class of reduced problems, P*. But, since A* does not 
use more operations than A, any lower bound for P* must 


also be a lower bound for P. 


Thus, provided that the parallel lower bound for P' is 
Known, it immediately yields a parallel lower bound for P. 
The technique, generally called the substitution argument, 
is a widely used technique for proving sequential lower 
bounds {Borodin §& Munro 1975]. But the technique does not 
rely on any particular computation model, whether sequential 
oc parallel. It primarily establishes a relationship between 
two classes of problems. To illustrate the technique, we use 


it to prove the parallel lower bound for multiplying n 


numbers. 


Theorem 2.3.1 At least flog ny parallel multiplications 
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E = K, * Xo * eee * Kis 
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in F(X,e X54 «sese x Je gi yeniere <0 {X) + X50 sere X}- 


Furthermore, this bound can be attained. 
+ 


Proof. By the substitution argument, let x, < Xx, 1S 2 
< ne. The problem is then reduced to evaluating Kat which, by 


Theorem fe 2aly requires at least rlog ny parallel 


multiplications. 


The parallel lower bound is attained by using the log- 


product algorithn. Q.E.D. 


In order to obtain tight parallel lower bounds, one 
must frequently combine all three techniques. To demonstrate 
this point, we will prove a parallel lower bound for 
evaluating Boolean recurrence equations which will be needed 


in Chapter 4. 


Consider the following first-order Boolean recurrence 
problem. Given any 2nt+1 Boolean constants {ane Ale seen Ale 
Die ceee Bb} evaluate the Boolean variables { x; | OSis<n } 
defined by 

0 Oy 
K,=a,+b,° x%,)- 112i a2. 
where the operators *'+* and ‘e' are naturally interpreted as 
being Boolean addition and amultiplication, respectively. 
First x, can be expanded as follows: 
KX, = a,+(b,)a;_,+* eeet(b,b,_) ---b,) a, (b; D;_) ---b)) ao 
We observe that x is a sum of it1 products. An obvious 


parallel algorithm for evaluating { x; | OSisn } is first to 


form the products in rlog (n+1) 4 parallel Boolean 
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multiplications. This is possible because no product has 
more than n+1 atoms and a Boolean version of the log-product 
algorithm will do. Next, we can form all the sums of 
products in -log(n+1), parallel Boolean additions by a 


Boolean version of the log-sum algorithm. (16) 


Since there are 2nt1 input arguments, by the fan-in 
argument, at least -log(2nt+1), < -flog(nt+1),4+1 parallel 
Boolean operations are required. On the other hand, since 
each term in any x, Contains no more than nt+1 atoms, by the 
growth argument, at least -log(nt1), parallel Boolean 
Operations are required. Thus, neither the fan-in argument 
nor the growth argument can show that the number of Boolean 
Operations required for the above parallel algorithm is 


optimal. 


Theorem 2.3.2 At least -flog(nt1), parallel Boolean 


additions and -log(nt+1), parallel Boolean multiplications 
are required to evaluate {f{ x, { O<i<n } on SIMD parallel 
computers (17), if no other Boolean operations except 
addition and multiplication are permitted. Furthermore, this 


bound can be attained. 


(16) This parallel algorithm uses O(n%) processors but 
another parallel algorithm based on the binary splitting 
technique uses only nt¢1 processors [Kogge 1974]. 


(17) On MIMD parallel computers, evaluation of Boolean 
recurrence equations has been studied by Brent [1970], and 
evaluation of general Boolean expressions by Preparata and 
Muller [1976]. 
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(o 


roof. First, substitute all the bi "s by the Boolean 
value 1 whereby evaluating x, 1S reduced to evaluating a ee 
an-1 +t e++ # a; # ap. This, by the fan-in argument, requires 
at least rlog (n+1) 4 parallel Boolean operations. 
Furthermore, by the growth argument, since any Boolean 
multiplication would increase the number of atoms in a tern, 


all of these operations must be Boolean additions. 


Next, substitute all the a,'S, except aye by the 
Boolean value 0 whereby evaluating x, is reduced to 
evaluating the single term b, b,j --- by aoe By a Similar 
reasoning, this requires at least -flog(nt1)y4 parallel 


Boolean multiplications, sinc2 any Boolean addition would 


increase the number of terms. 


Finally, by the definition of SIMD computation, Roolean 
additions and multiplications can not be performed 
Simultaneously. Thus, the parallel lower bound is proved. 


Q-E.D. 


Note first that the fan-in argument, growth argument 
and substitution argument are all employed in the above 
proof. Also, a restriction is placed on the types of Boolean 
operations permitted. In fact, the last theorem is contrived 
strictly for illustration purposes. If arbitrary Boolean 
operations can be used, then Boolean aiditions or Boolean 
multiplications are not needed at all. For example, the NAND 


(oc NOR) operation can be used exclusively to express any 


Boolean function. 
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In this chapter we consider univariate and multivariate 
polynomials with coefficients in F. On the one hand, 
polynomials may be viewed as being functions which can be 
evaluated, and on the other, as being algebraic entities 
which may themselves be added, subtracted, multiplied and 


divided. Both points of view are considered in this chapter. 


The evaluation of polynomials is considered in Section 
3.1. A tight parallel lower bound for this problem is given. 
From the proof it follows that a very early parallel 
polynomial evaluation algorithm designed by Estrin is 


optimal on SIMD parallel computers. 


In Section 3.2 some Simple polynomial arithmetic 
including addition, subtraction and multiplication is 
considered. Parallel algorithms for solving these problems 
are presented and their optimality proved. Most of these 
results are trivial and are included primarily for future 


references. 


Division of polynomials is considered in Section 3.3. 
An efficient parallel algorithm for solving this problem is 
presented. A parallel lower bound for polynomial division, 
which is of the same complexity as the parallel lower bound 


for polynomial multiplication, is derived. 


In the last section, Section 3.4, we consider the 
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evaluation of elementary symmetric functions, which finds 
its application in parallel polynomial nodular arithmetic in 
Chapter 5. An efficient parallel algorithm for evaluating 
all elementary symmetric functions is presented and its 


optimality proved. 
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3.1 Evaluation of Polynomials 


Given a polynomial of degree n 
f(x) = E Qt f\x Se fixe 
the optimal sequential algorithm for evaluating f(x) is the 
well-known Horner's rule which requires n additions and n 
multiplications. In fact, this algorithm is uniguely optimal 


on sequential computers [Borodin 1971}. 


The first parallel algorithm for evaluating polynomials 
was designed by Estrin [1960]. It uses the binary splitting 
technique, namely, it first computes 

fot £) Xe £, + £ 3X, sees X* 5 
and then rc eee nes Ce toes) ee ean 2h 
weley and SO On. 
This parallel algorithm requires -log(nt1), parallel 
additions and -log(nt?1) , parallel multiplications, assuming 


tL(n+1)/25 processors are available. 


Various other parallel algorithms are developed by Dorn 
[1962], Munro and Paterson [1973], Maruyama [1973] and Kung 
(1974]. Dorn's algocithm is essentially a generalization of 
Horner's rule. Munro and Paterson's algorithm and Marcuyama's 


algorithm both assume a MIMD computational model. 


In this section it is shown that, if division is not 
allowed, at least -glog(nt1), parallel additions and 
rlog(nt1)4 parallel multiplications are required to evaluate 
f(x) on SIMD parallel computers. Thus, on SIMD parallel 


computers Estrin's oldest parallel algorithm is in fact 
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Optimal (in F{x, fo « fie eee, £,])- This is probably one 
good example of why a systematic approach to deriving 


parallel lower bounds is important. 


Theorem 3.1.1 At least -log(nt+1), parallel additions 
and rlog(n+1), parallel multiplications are required to 
evaluate f(x) in Fx, foe Ele cece f£ J given F U {x, foe fi. 


soay fi on SIMD parallel computers. 


Proof. Using the substitution argument, first consider 
the case x <- 1. The problem is reduced to the summation of 
the nt1 indeterminates, fot f+ oie ait f.- By Theorem 2.1.7, 
at least slog(nt1), parallel additions are reguired to do 


the Summation. 


Next, consider the case fs <>) Xeand) £7 <-—) 0,7 8toe all i 
< n. The problem is now reduced to that of evaluating the 
power x1, By Theorem 2.2.2, at least -log(nt1), parallel 


multiplications are required to do the evaluation in F[KX, 


foe fie cece f i- 


Finally, by the definition of SIMD computation, 
parallel additions and multiplications can not be performed 
Simultaneously. Thus, at least -rlog(nt+1), parallel additions 


and -log(nt+1), parallel multiplications are required. Q.E.D. 


Unlike Horner's rule for sequential computation, 
however, Estrin's parallel algorithm is not uniguely 
optimal. For example, a different parallel algorithm with 
the same operation bound can be obtained as follows: First, 


all the nt1 terms of f(x) can be computed in rlog(nt1), 
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parallel multiplications and then the terms can be added in 


rlog (n+1)4 parallel additions. 


In addition, if computation is not restricted to F[K, 
fo « fie assy f de then a parallel algorithm which requires 6 
parallel multiplications / divisions and 2rlog ny + 6 


parallel additions can be designed [Kung 1974]. 
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3.2 Simple Polynomial Arithmetic 


In this section we discuss some Simple polynomial 
operations, including addition, subtraction and 
multiplication of polynomials. Some of the results are 
trivial, and are included for future reference and for the 


sake of completeness. 


The Simplest Operations on polynomials are 
multiplication and division of a polynomial, say f, by a 
(positive) power of x, i.e., fex* and f/x*. These operations 
can be regarded as being "scaling" of polynomials, just as 
integers can be scaled by powers of b, where bis the base 
(or cadix) [Cf. Aho et al 1974). The highest (n+1)-st 
coefficients of fxex* are exactly the same as those of f, 
whereas the remaining k lower order coefficients are all 
zeros. The coefficients of the quotient of f/x* are the same 
as the first n-k+1 coefficients of f and the coefficients of 


the remainder of £/x* are exactly the lowest k terms of f. 


There may be hidden costs associated with the 
multiplication and division by powers of x (as far as 
programming is concerned) but there are no explicit 
arithmetic operations performed. Indeed, when polynomials 
are scaled as part of a more complex process, the cost of 
this scaling will normally be negligible and can usually be 


ignored. 


Multiplication and division of a polynomial by a 


constant polynomial are also easy to perform. fhe 
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multiplication (or division) of a polynomial of degree n, 
Sileie sce tt x te seweee tex, 
by a constant polynomial c is the multiplication (or 


division) of its coefficients by c. 


Theorem 3.2.1 The multiplication (or division) of the 
polynomial f£ by a constant polynomial c can be computed in 1 
parallel multiplication (or division) on a SIMD parallel 
computer with nti processors. Furthermore, the parallel 


operation bound is optimal. 


Proof. The parallel operation bound and _ processor 


requirements are obvious. 


pancesf£-* ¢ (or £;:7 %c)-is notoine’ Us{t 30) {ch sat 
least one parallel operation is required. Thus, the parallel 


Operation bound is optimal. Q. 8. D. 


The addition (oc subtraction) of two polynomials 
£(z) = £)¢ £755 * sect £.x" 
and G(X) = dyt 9]X + eo + gx" 
is the addition {or subtraction) of their corresponding 


coefficients. 


Theorem 3.2.2 The addition (or subtraction) of the 
polynomials f and g can be computed in 1 parallel addition 
(oc subtraction) on a SIMD parallel computer with n+1 


processors. Furthermore, the parallel operation bound is 


optimal. 


Proof. The parallel operation bound and processor 
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requirements are obvious. 


Since f+ g, (oc fo>39 iis 10 tine Feu (f,) U {gp ee tat 
least one parallel operation is required. Thus, the parallel 


operation bound is optimal. Oe nels 


The multiplication of general polynomials is also quite 
straightforward. Given two polynomials of degree m and n 
f(x) = Ejt £5x + <2.) * foxe 
= n 
and g (x) dot g,x re eee: g Xs 
their product (often called the convolution product ) is a 
polynomial of degree mtn 
h(x) ="h © how +524 bh » et 2 irs 
0 i m+n 
where h,; = £49,* £i95,_4+ erevet it £9 oe 0 < i < mtn. Note that 
f (or 5.) is tacitly assumed to be zero if k < 0 or k>o 
(or k > n). 


Fast sequential polynomial multiplication algorithms 
ace based on the Fast Fourier Transform (FFT) and are of 
complexity O(n log n) [Aho et al 1974, Borodin & Munro 
1975]. These algorithms are much more complicated than the 
classical multiply-and-add algorithm. [It is therefore 
important to note that on parallel computers the classical 


polynomial multiplication algorithm is indeed optimal. 


3 If m 2 n, then the product of the two 
polynomials £ and g can be computed in 1 parallel 
multiplication and -log(nt1)4 parallel additions on a SIMD 
parallel computer with (m+1) (n+1) processors. Furthermore, 


the parallel operation bound is optimal. 
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Proof. First compute the products, ee OE< 39S. a, 07S 
j < ne, in one parallel multiplication with (m1) (n+1) 
processors. Then compute the coefficients h = f£,9,+ ok: oe 
e-- + £;9, by the log-sum algorithm. Since re: re 0 
whenever i- k > n, there are at most nt1 nonzero summands 
in each ho 0 <i < mtn. Thus, the coefficients can be added 
in rlog(n+1), parallel additions. Obviously, (m+1) (n+#1) 


processors are sufficient for doing the summations. 


Note that he. may be viewed as being the inner product 
of two (n+1)-vectors. BY Theorem 2.163, “at “least 
rlog(nt+1)4+1 parallel operations are required to compute 
just h,. Thus, the parallel operation bound is optimal. 


Q.E.D. 
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3.3 Division of Polynomials 


Given two polynomials, one of degree m+n 
f(x) = £\+* £.x Be Se 0 LS peKees 
and one of degree n, n > 1, (128) 
G(X) = dot GK +... + Gx’, 
there exist unique polynomials gq and r such that 
f=q*g+r, deg(r) < deg(g), 
Where q 1S a polynomial of degree a, 


In 
xX = + x + eee + e 
g (x) a" a qx 


The polynomial q is called the quotient and denoted by 
For any integer k > 0, the quotient txk*#/7gs is called the 


reciprocal of g of degree k. (19) 


The existence of gq and r can be proved constructively 
by ‘long division', i.e., a multiple of g is successively 
subtracted from f until the final remaining polyaoatal has 
degree less than that of g. If there are nt+1 processors 
available, then the number of parallel operations required 
by long division is proportional to m+t1, the number of 
coefficients in the quotient q. Note particulacly that 


explicit division of the coefficients is performed only by 


€18) Since we have discussed division by constant 
polynomials and powers of x in Section 3.2, we assume that 
the divisors are non-trivial and of degree 2 1 throughout 


this section. 


(19) This definition is Slightly more general than that 
given in [Aho et al 1974]. 
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J,s So that if g is monic, no division is actually 


performed. 


A faster parallel polynomial division algorithm can be 
obtained from the product of f and the reciprocal of g (29), 
First, observe that it suffices to compute the quotient de 
Since the remainder rc can be obtained in one polynomial 
multiplication and one polynomial subtraction. Next, only 
the first m+1 coefficients of f and g are relevant to the 
computation of q. More precisely, the quotient gq is given by 
the following formula: 

q = C(Lf/xOs ety MeN 7gs) /xM5, C21) 
iee., the coefficients of q are the first mt+1 coefficients 
of the product obtained by multiplying the first m+1 


coefficients of f with the reciprocal of g of degree n. 


Examples. Let g = 2xt1. Denote the reciprocal of g of 


degree k by uk). Then, u€9) = 1/72, ult) = x/2-1/4, 


u€3) = x3/2-x2/4+x/8-1/16. 


1. For f = 3x+1, let 
v= Cf/xixul9) = 3x(1/2) = 3/2. 


Then 


eee ee iw ee i a eS 


C20) In fact, if the divisor will be used more than once 


future use. This is called precomputed division in [Kung 
1973]. 


(22) This formulation is equivalent to those of [Kung 1973] 
and [Borodin & Munro 1975]. 
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gq = tv/x0s = 8725 


2. For £ = 2x2+3x+3, let 


vV = &f/xtxuCt) = (2x43) (x/2-1/4) = x2+x-3/4. 


gq = tv/x! = xt. 


3. For f = x4t5x2+3x+1, let 
v = lCF/xtxul3)d) = (x345x+3) (x3/2-x2/44x/8-1/16) 


= x6/2-xS/4+21x*/8+3x35/16-x2/8+x/16-3/16. 


g = tv/x3! = x3 /2-x2/4421x/8+3/16. 


The reciprocal &x™*"/qi of degree m can be computed 
iteratively as follows [Aho et al 1974}: 

The reciprocal of degree 0 is given trivially by 

aE GWA he 
Having obtained u, the reciprocal of degree k=(21-1)-1, 
then, uwu*, the reciprocal of degree 2k+1 = Que can 
be computed by means of 

v <- 2ux3 kei - u2tg/xn-2k-14, 


ut <- ty/x2ks, 


Examples. Let g = x?-x6#x5+2x*-x3-3x2+xt4, Denote the 


reciprocal of g of degree k by uCk>, then, 
uC0) = 1/g, = Ie 
To find ult): 


2ucodx2 - (u(9))2tq/x6s 


v 


u(t) = tv/x0s = v = xtt. 
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To find ul3); 


V = 2uCtdx* - (uCtd)2tgsx4s 
= 2(x+1)x* - (x4#1)2 (x3-x2+x#2) 
= x5+x*-3x2-5x-2 

uC3) = ly/x2s = yx 342-3, 


To find uC7); 


Vv = 2u€3)dx20 - (yC39)2tg/xos 


XPS 4x12-3x10-4x94+3K%84+15x7412x6 
~42x5-34x4+39x3+51x2-9x-36. 
uC7) = by/xes = ¥74+xX6-3x%-4X34+3K24+15x412. 


And so on. 


Theorem 3.3.1 The reciprocal of g of degree m can be 
computed in O{(,-log(mt1),,-log(n+1)4) parallel additions / 
subtractions and O(rlog(mt1)4) parallel multiplications / 
divisions on a SIMD parallel computer with (mt1) (n+1) 


processors. 


Proof. The iterative formula for computing the 
reciprocals of g, at each stage, entails two polynomial 
multiplications and one polynomial subtraction. Thus, the 
number of parallel operations tT (2<=m+1) required to compute 
the the reciprocal satisfies, at the i-th stage, 

g(2t) < r(2 #1) + 2 multiplications (1a) 
+t 2-log(nt#1), additions + 1 subtraction. 
In particular, 
T(2°) = 1 division. (1b) 
From the recurrence relation (1), we conclude that the 


reciprocal of degree m can be computed in 
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0 (log (m+ 1) 4 -log(n+t+1) 4) additions / subtractions and 


O(rlog(mt1)4) multiplications / divisions. 


Within the parallel algorithm, the process uSing the 
greatest number of processors is polynomial multiplication. 
At the i-th stage, it is easy to observe that at most 
2? (n+1) processocs (where 2i< m+1) are used. Thus, 
(m+1) (n+1) processors are sufficient for the entire 


computation. OLE. D. 


Corollary 3.3.2 The quotient gq and remainder r of f 
divided by g can be computed in O(-log(mt1) 4 -log(n+t1) 4) 
parallel operations on a SIMD parallel computer with 


(m+1) (n#1) processors. 


We now derive a parallel lower bound for polynomial 
division. Without loss of generality, assume that the 
divisor g iS monic. From the identity f = q * g + r, it then 


follows that the coefficients of gq are given by the linear 


recurrence equation: 


qn oa mtn 
m -L . m+n—-1 ~ 9 aeion 
= — pee eco i < Me 
eet = mt+n-i Seah a eee ah f. 


Observe that q ., 0 € i < m, is a polynomial of degree i in 
kL. 


g - In particular, ay is a polynomial of degree m in es 
n-1l = 


Theorem 3-3-3 At least -log(mt1), parallel additions 


and -log(m+1), parallel multiplications are required to 


compute the quotient gq in Flt, coor fin’ Ghee efor 
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given F U {f 3 U {9} on SIMD parallel computers. 


Proof. €22) Since fig a polynomial of degree om, by 
Theorem 3.1.1, to evaluate it alone requires at least 
rlog (m+1) 4 parallel additions and -log(m+1), parallel 


multiplications in F Sas 
p ions in F[f.. pf aw So 


Sonera Uersi| ae eae 


0” n-1 

Note that the parallel lower bound can be attained only 
when g is of the first degree. In Chapter 5, however, we 
give a more efficient parallel algorithm (which depends on 
results yet to follow) for all an. The parallel algorithm 
given there is not practical since it assumes exponentially 
bounded parallelism. However, in the case of unbounded 
parallelism at least, we will have shown that polynomial 


multiplication and division are of the same parallel 


complexity. 


(22) An alternative proof can be given by using the 
substitution argument. 
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Let s(n,k), 0 < k <n, be the elementary symmetric 
function of degree k in n indeterminates { y | 1<i<a }. The 
elementary symmetric functions { s(n, k) | O<k<n } are 
formally defined to satisfy: 


h (x) 


(xty,) (xty,) +2. (x+y) 

= x7” # s(n,1)x™-1 + ...4+ s(n,n-1)x + s(n,n).~ 
(Note that s(n,0) is always 1 regardless of the value of n.) 
Each s(n,k) is the sum of all k-combinations (or products) 
formed from the n indeterminates. Therefore the number of 
products in s{(n,k) is C(n,k) = n!/k!(n-k)!. Moreover, 
s(n, *n/24) has the greatest number of products, C(n,*n/24) = 
O23 products. Therefore, if the elementary symmetric 
functions are computed as the summation of products, then at 


least O(n) parallel opeations are required, which on 


parallel computers is deemed inefficient. 


A direct expansion of h(x) leads to a more efficient 
parallel algorithm. If h is expanded by forming products of 
2 factors, then products of 4 factors, and so on 623), then, 
using the polynomial multiplication method discussed in 
Section 2, the number of parallel operations required is 
O(log2n) on a parallel computer with n? processors. In fact, 
this parallel operation bound is the best previously 


reported result [Goyal 1975]. In Theorem 3.4.2 below, we 


ee ee a ee er a a 


(23) Products formed in this manner are called supermoduli 
in {Moenck & Borodin 1972]. 
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give a parallel algorithm which requires only O(log n) 


parallel operations. 


Lemma 3.4.1 Given n, the set of powers x27. °K3 Fees 
x"} can be computed in flog ny parallel multiplications on a 


SIMD parallel computer with n/24 processors. 
Proof. [Borodin & Munro 1975, p. 128]. Q.E.D. 


The following theorem, due to Csanky [1974], gives an 
efficient parallel algorithm for computing the elementary 
Symmetric functions. In his development, however, Csanky 


ignores the cost of computing powers of the Fourier points. 


Theoren 3 2 The elementary symmetric functions 
{ s(n,k) { O<k<n } of n indeterminates { 1 { t<i<n } can be 
computed in Lee oe seals a. given F JU {Y,« You cor vy} in 
rlog (nt 1)4+1 parallel additions, rlog n,+1 parallel 
multiplications and 1 parallel scalar division on a SIMD 


parallel computer with (n+1)2 processors. Furthermore, the 


parallel operation bound is optimal. 


Proof. First, evaluate h(x) and {x2, x3, ..., x } at 
the nt? Fourier points {wi}, 0 < i <n, where wisa 
primitive (nt+1)-st root of unity. Obviously, this can be 
done in 1 parallel addition (n(n+1) processors) and log ny 
parallel multiplications (2(n+1) ‘n/24 processors). Next, the 


elementary symmetric functions are obtained from the 


following formula: 


s(n,k) =[ h(wipxw-iGmk)] 7 (n+1), 0 ¢ k <a. 
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This can be computed in 1 parallel multiplication ((n+1) 2 
processors), -log(nt+1), parallel additions ((n+1) &(n+1) /23 


processors) and 1 parallel scalar division (n+1 processors). 


Since s(n,1) = 7 Vey et Pee Y« by Theorem 2.1.7, at 
least rlog ny parallel additions are required. Similarly, 
Since s(n,n} = Vey Gases ey ce by Theorem 2.3.1 at least 
clog ny parallel multiplications are required. Thus, by the 
definition of SIMD computation, the optimality follows. 


OED. 


Note that the above parallel algorithm is based on 
discrete Fourier transforms which are special cases of 
evaluation and interpolation (Cf. Chapter 5) since the 
Pourier points {wi}, 0 < i < an, are special. Although 
discrete Fourier transforms in general require 2,l9g(nt1)4 
parallel operations to compute (Cf. Section 1 on evaluation 
of polynomials), the evaluation and interpolation in the 
above theorem are made especially easy due to the fact that 
h(x) is already factorized and that the powers of {wi} can 


be computed before the evaluation of s(n,k) begins. 


In addition, the above algorithm is not suitable for 
sequential computation, since it reguires O(n?) sequential 
operations. The best sequential algorithm requires only 


O(n log2n) operations {Borodin & Munro 1975]. 
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Denote an n-digit, base-b (where b > 2) integer x by 


nh 
Ki=G2) X(i) eub sje. s xs) <sbo1n 


1 


The integer x given in the above form is said to be in 


In this chapter we study the parallel complexity of 
performing the following processes on SIMD parallel 
computers: 

1) addition or subtraction of n-digit integers, giving 

an n-digit result plus a possible carry; 

2) multiplication of an m-digit integer by an n-digit 

integer, giving an (mtnj-digit product; 

3) division of an (mtn)-digit integer by an n-digit 

integer, giving an (m+1)-digit quotient and an n- 


digit remainder. 


We assume that the following (binary) operations are 
primitive: 
1) addition or subtraction of 1-digit integers, giving 
a 1-digit result and a carry; 
2) multiplication of a 1-digit integer by another 1- 
digit integer, giving a 2-digit product; 
3) Boolean addition (OR-ing) oF multiplication (AND- 


ing) of two bits, giving a 11-bit result. 
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Thus, for example, if x and y are 1-digit integers, 


then it is assumed that addition yields the values 


n 
ul 


(x + y) mod b, 


and eee (ey) /e bere 


In Section 4.1, a parallel algorithm for performing 
integer addition is considered. The concept of pseudo- 
positional integers (integers whose digits are bounded by 
2b-2 rather than b-1) is introduced. Using this concept, 
integer addition is reduced to conversion of integer from 


pseudo-positional to positional notation. 


In Section 4.2, a parallel algorithm for performing 
integer multiplication is described. Pseudo-addition is 
crucial to the design of an efficient parallel integer 


multiplication algorithn. 


In Section 4.3, parallel algorithms for performing 
integer division are considered. A tradeoff between the 
number of parallel operations required and the number of 


processors used is observed. 


For all problems efficient or asymptotically optimal 


parallel algorithms are given. 
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4.1 Addition and Subtraction of Integers 


Given two non-negative n-digit integers in positional 


notation, 


nh 

Re ele bee 0a x) soho, 
n . 

Vitals yey (10H bg ay bad; 


their sum is an n-digit result plus a possible carry 


nN 
Sas (Lee eb 38,0 es es (1) sah 1 


i=0 


Let c(i) denote the carry at the i-th position. Then, 


s(n) = { x(n) + y(n) ] mod b, (1a) 
en rece exe (np ey (0) 7 be, (1b) 
and for 1 = n-1, «.., 2, 1, S({i) can be computed iteratively 


by means of: 


S (1) [ x(i) + y(i) + c(it1) J] mod b, (1c) 


© [> xG) + y(t) + c(zt1) J 7 b 4 (14) 


c(i) 
The O0-th digit, s(0), is given by 


s(0) = c(1).- (Te) 


This iterative addition  aalgorithn requires O(n) 


parallel operations and therefore is too inefficient on 


parallel computers. 


On the other hand, since polynomial addition is quite 


cheap, the sum s can be computed as if x and y were 


polynomials in b: (24) 
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n . 
S= 2% wi) « bat, 


where w(i) = x{(i) + y(iy, 1 <5 i < on. Note, in particular, 
that the coefficients of the polynomial sum are bounded by 
2b-2 rather than b-1. This motivates the notion of pseudo- 


positional integers: 


Definition 4.1.1 Let w be a positive integer given by 
n . 
WS y ww) x bo 7) 0 <4 (<5 2b-2. 


The integer w given in the above form is said to be in 


2S See —_ a ore Se ee ee oe <p = em ee ee ee ee 


Thus, addition of integers is reduced to the conversion 
of integers from pseudo-positional notation to positional 
notation. This Switching of views is particularly 
fundamental in the design of efficient parallel addition and 
multiplication algorithms discussed in sebsequent Sections. 


It is then this problem to which we now turn attention. 


Since all pseudo-positional integers appearing in this 
chapter are obtained as intermediate results, it is assumed 


that both digits in the representation of each w(i), 


ee ee ws ee ee oe a ee ee ee 


C24) As a matter of fact, all arithmetic on multiple- 
precision integers can be viewed as being polynomial 
acithmetic [Aho et al 1974, Borodin & Munro 1975]. However, 
the correct cesults must be recovered from the polynomial 
results by releasing the carries [Borodin & Munro 1975, pp. 


90-91]. 
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S'(i) = w(i) mod b, 


c! (i) Otw (2)\) /2be!s,. eset <ane 

are available without any additional cost. The iterative 
addition algorithm (1) given above then can be interpreted 
as being a conversion algorithm: 


S(n) = s"(n), (2a) 


c (n) c'(n), (2b) 
for i = n-1, eeog Ze ie 
S(i) = [ s*(i) + c(it1) J] mod b, (2c) 


c'(i) + © s'(i) + c(it1) J 7 b4, (24) 


" 


c (i) 


and Ss (0) etl. (2e) 


This conversion algorithn is still inherently 
seguential due to its iterative nature. The first key to a 
fast parallel conversion algorithm is the observation that 
at most one carry is possible in each digit position during 
the addition of positional integers {Knuth 1969, p. 231]. 
This observation is rephrased in more general form in the 


lemma below. 


2 Given any pseudo-positional integer, during 
the conversion to positional notation, the carry at each 


position is either 0 or 1. 


The vector of carries c = (c(1), c(2), «--, C(m)) can 


therefore be viewed as being a Boolean vector. 


The next observation is that, by formulae (2b) and 
(24), c(i) implies c(i), 1 <i S$ n, but not conversely. In 


other words, the vector c' is Boolean and indicates a carry 
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generating condition. Furthermore, by formula (2d), a carry 
at the (it+1)-st position, i.e., c(it1) = 1, will propagate 
to the next position if and only if 
S* (i) + c(it1)-= st (i) # 1 = b: 
or, equivalently, if and only if 
s*(i) = b-1. 
Denote this carry propagating condition as a Boolean vector 
P(i)>= 1 4£f ¢s' (i) = b=1,, 1 = 20 = n=-1. 
Then, formulae (2b) and (2d) can be rewritten as: 
c(n) = c! (n) (3a) 
c(i) = c'(i) + pi) © c(it1), 1 < i < n-1. (3b) 
That is, by using two auxiliary Boolean vectors, the carries 
{ c(i) | 1Si<n } have been shown to satisfy a Boolean first- 


order recurrence equation. 


3 The conversion of an n-digit pseudo- 
positional integer to its positional notation can be 
computed in 2,rlog n, parallel Boolean operations and 1 
parallel addition on a SIMD parallel computer with an 


processors. 


Proof. BY Theorem 2.3.2, the Boolean recurrence 


equation (3) can be computed in 2rlog nm parallel Boolean 


Operations with n processors. 


Once the carries { c(i) {| 2<i<n } are available, by 
formula (2c), the digits { s(i) | 1SiSn-1 } can be computed 
in 1 parallel addition (n-1 processors). On bade 


Theoren 4.1.4 The sum of two non-negative n-digit 
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integers can be computed in 2 parallel additions and 
2rlog ny parallel Boolean operations on a SIMD parallel 
computer with n processors. Furthermore, the parallel 


operation bound is asymptotically optimal. 


Io 


roof. By Theorem 3.2.2, the polynomial sum of two 
integers can be computed in 1 parallel addition (n 
processors). Since the polynomial sum is a pseudo-positional 
integer, by Lemma 4,1.3, it can be converted to positional 
notation in 2,rlog ny parallel Boolean operations and 1 


parallel addition. 


Since s(0) is a function of the 2n inputs { x(i), y(i) 
{i<i<n 3}, by the fan-in argument, at least ,-log(2n), = 
rlog ny+1 parallel operations are needed to compute just 


s(0). Thus, the asymptotic optimality follows. O.E.D. 


It is important to note that although the techniques 
used in this section is purely number theoretical, the 
parallel addition algorithm developed above includes the 
carry look-ahead adder {Flores 1963], which is derived from 
switching function theory, as a special case (i.-e., when b = 


2). 
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NOTE: The above results can eaSily be extended to 


include the addition or subtraction of negative integers. 


For example, the true complement (i.e., b's complement) of 


any n-digit integer can be computed in 1 parallel single- 


precision subtraction and 1 n-precision addition. (25) (26) 


(25) The sign-digit of the b's complement of any integer 
must be interpreted and operated on as a binary digit. 


(26) Overflow is determined by the usual rule that if the 
carry out of the sign-digit position and the carry out of 
the first digit position disagree, then an overflow occurs. 
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4.2 Multiplication of Integers 


Given two non-negative integers, namely, an m-digit 


multiplicand 


at 
fT] 
Hs 
img 
bo 


x(i) * bmi, 0 < x(i) < b-1, 


and an n-digit multiplier 


19) 6 
y = 1 y(i) * bm1, 0 < y(i) < b-1, 
their product is an (mtn)-digit result 
m+n : F 
Ww a w(t) * bmlen-2 0 < wi) s p=1-. 
ae = 


The classical paper-and-pencil algorithm for computing 
w is first to compute the n partial products 
P(j) = x *# y(jbe 1S 4 Sn. 
Each partial product p(j) is then normally converted to 


positional notation 
p(j) = a p(iej) * bt-i, 0 £ pli, j) ¢§ b-1. 
1= 
Next, the sum of the scaled partial products yields 


w= i p(j) * bn-j. 


Since frequent conversions to positional notation are 


expensive, the strategy should be to avoid the use of 


positional integers during the calculation of intermediate 
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results, thereby postponing conversions to _ positional 
integers until the last Stage of the computation. This 
consideration leads us to take a moce Serious look at 


pseudo-positional arithmetic. 


2-1 Define the pseudo-sum of two pseudo- 


nh : r 
a H(L) “* bo=L, 70 =<) x (i = 2b-—2, 


and y y(i) « bn-i, 0 < y(i) < 2b-2, 


i] 
Imps 
fH 


to be the pseudo-positional integer 


nN 
s = oe) «Die < Ses eZb=2-, 
1= 


The process of obtaining the pseudo-sum s is called 


Algorithm P. (Pseudo-Addition) 


P1. Set s(0) <- 0, 
s(i) <- x(i) + y(i), 18 1 S 0. 
P2. If s(i) < 2b-2, 0 < i < n, then terminate. 
P3. Otherwise, release one round of carries: 
(Tetec’ (a) & s({i) 7 b4 


and s*(i) = s(i) mod b) 


Set s(n) <- S'(n), 
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S(i) <- s* (i) +5c' (i411), 0 < 2 <n-1 


Go to step P2. 


Lemma 4 


2-2 The pseudo-addition of two pseudo- 
poSitional integers can be computed in at most 3 parallel 
additions (the first of which may be double-precision 
addition) on a SIMD parallel computer with n+1 processors. 
Furthermore, if b 2 3, only 2 parallel additions are 


sufficient. 


Proof. At the end of step P1, 
0 < s(i) = x(i) + y(i) < 4b-4. 
Thus, 
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Or, equivalently, 


oY (1)24-b, if b = 2, 
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b-1, alge b Le sie 


We next examine the number of times the loop P2 - P3 
must be repeated. The first time through P3 yields 
s(100 < (b=1)0 + Cl 41) = 2b— 172k bez, 


<22b—2, it Dee. 


Thus if b > 3, only one iteration of steps P2 - P3 is 
sufficient. 
Tf b = 2, one more loop through P2 - P3 results in 


(notecthatic%(i} 1s 1 = b=-1) 


s(i) < (b-1) + (b-1) = 2b-2. 


Thus, for b = 2, at most two iterations of loop P2 - P3 are 
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sufficient. 


Now since step P1 can be computed in 1 parallel 
addition (double-precision, according to the definition of 
primitive addition given previously), and step P3 can be 
computed in 1 parallel (single-precision) addition, the 


lemma follows. Q.E.D. 


Note that the number of parallel operations required to 
compute the pseudo-sum is a constant regardless of how many 
digits the pseudo-positional integers have. Thus, by using 
pseudo-addition, not only is the growth of the intermediate 
results during multiplication controlled, but also the cost 


of combining the partial products is reduced. 


The fastest sequential integer multiplication algorithm 
is one given by Schoenhage and Strassen [1971] and is of 
complexity O(n log n log log nj). Basically it uses the idea 
of discrete Fourier transforms. A disadvantage of the 
algorithm is that it is difficult to code and its benefits 
are realized only for n sufficiently large (Knuth 1969, Aho 


et al 1974]. 


The fastest previous known parallel integer 
multiplication algorithm is one given by Atrubin [1965] and 
is of parallel complexity O(n) on a linear iterative array 
with n modules (Knuth 1969}. The parallel integer 
multiplication algorithm given below is essentially a 
parallelized version of the classical paper-and-pencil 


algorithm. It differs primarily in that, at all intermediate 
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steps, conversion is made to pseudo-positional rather than 


positional notation. 


Algorithm M. (Multiplication of Non-negative Integers) 


M1. Compute the digit-by-digit products 


{ x(i) « y(j). | 1Si<n, Ts 4jsn0 yy. 
M2. Convert the n partial products 
pO) = 2 x) # yC) # bmi, 1 <5 <0 
to pseudo-positional notation, 
p' (i,j) * b™1, 0 < pt(i,4) < 2b-2. 


5 
a 1 = 
pedi = 2 


M3. Compute the pseudo-sum w = p' (jj) « b2-Jd. 


tar 
fh 


M&. Convert w to positional notation. 


3 The product of an m-digit multiplicand 
and an n-digit multiplier can be computed in 1 parallel 
multiplication, 3rlog nyt2 parallel additions and 
2rlog(mtn)y parallel Boolean operations on a SIMD parallel 
computer with mn processors. Furthermore, the parallel 


operation bound is asymptotically optimal. 


Proof. It is obvious that step M1 can be computed in 1 


parallel multiplication with mn processors. 
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In step M2, 0 < x(i) * y(5) < (b-1)2 < b(b-1) implies 
that all carries are bounded by b-2. Therefore, after one 
round of carry releasing, all digits are bounded by (b- 
1)#(b-2) = 2b-3 < 2b-2 (i.e., after one round of carry 
releasing, the result is in pseudo-positional notation). 
Thus, step M2 can be computed in 1 parallel addition with an 


processors. 


In step M3, using a generalized log-sum algorithm, the 
pseudo-sum w can be computed in -log ny pseudo-additions. By 
Lemma 4.2.2, each pseudo-addition can be computed in at most 
3 parallel additions. Thus, step M3 can be computed in a 
total of 3rlog n, parallel additions. Since each pseudo- 
positional integer has at most mtn digits and at most tn/23 
pseudo-sums are formed at any time, a total of (mtn)tn/2! < 


mn (if m 2 n) processors is sufficient for computing w. 


Finally, the conversion of w to positional notation in 
step M4, by Lemma 4.1.3, can be computed in 1 parallel 
addition and 2,log(mtn), parallel Boolean operations (mtn 


procesSSOrs). 


The total operation count is therefore 1 parallel 
multiplication, 3rlog nyt2 parallel additions and 


2rlog (mtn), parallel Boolean operations with mn processors. 


The asymptotic optimality is obvious, Since the product 


depends on all mtn inputs. Q.E.D. 
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Let a and c be any (m+n)-digit and n-digit positive 


integers, respectively. We wish to find a quotient q and a 


S8=29 “°C -t er. 20 Sracuc. 


AS usual, once a fast parallel multiplication algorithm 
is given, it can be used for the design of a fast parallel 
division algorithm as follows: First, it suffices to compute 
the quotient gq only, since then r can be obtained by means 
of r =a- q * c. Second, let 1/c denote the "reciprocal" of 
ce Then finding gq = a «* (l/c) can be reduced to 
multiplication and ceciprocal approximation [Knuth 1969, Aho 
et al 1974, Borodin and Munro 1975]. The Newton iteration 
formula, 

yo pemed(y, = Wya)zcy 


for example, can be used to compute the reciprocal of c. 


1 The division of an (m+n)-digit integer by 
an n-digit integetc can be computed in O(log?(mt+1)) + 
O(log(mtn)) parallel operations on a SIMD parallel computer 


with (m+1) (m+n) processors. 


Proof. The above Newton iteration formula entails two 
(multiple-precision integer) multiplications and one 
subtraction. In this case, the number of parallel operations 
g(2k=m+1) required to compute the reciprocal satisfies, at 


the i-th stage, 
gT(2ity = 1(24-1) + O(4)- 
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where 
T(29)) =. 13, 
Thus, T(2k) = o(k2), and the (m+1)-digit reciprocal can 


therefore be computed in O(log? (m+1)) parallel operations. 


The number of processors used at the i-th Stage is 2in. 
Thus, a total of (m+t1jn << (m#1) (mtn) processors is 


sufficient for computing the reciprocal. 


Next, by Theorem 4.2.3, the quotient q = a * (1/c) can 
be computed in O(log (mtn)) parallel cperations with 


(m+1) (m+n) processors. 


Finally, the remainder r = a - g « c can obviously be 
computed in O(log(mtn)) parallel operations with (mt1) (mtn) 


processors. 0. Bade 


1-digit integer can be computed in O(log?m) parallel 


operations on a SIMD parallel computer with m? processors. 


The parallel integer division algorithm given above is 
slower than the parallel integer multiplicaticn algorithm. 
This is in contrast with our experience on Sequential 
computers, where integer division and multiplication are of 
the same complexity. However, the above iterative algorithm 


is not the fastest possible parallel algorithm. 


an n-digit integer can be computed in O(log(mtn)) parallel 


operations on a SIMD parallel computer with bmtt(mt1)n 
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processors. 


integer by an a-digit integer results an  (mt1)-digit 


quotient and an n-digit remainder. 


For all integers x, 0 < x < bmi, i.e., x has at most 


m+1 digits, do the following: 


1. Compute x * c. 
Since we are concerned only with integers which are 
candidates for the quotient, we compute only the least mtn 
Significant digits of x * c and discard all x such that 
x «x Cc exceedS mtn digits. This, by Theorem 4.2.3, can 
certainly be done in O(log(mtn)) parallel operations 


(bm+t(m+1)n processors). 


2. Compute a - X * Cc. 
This can certainly be done in 0 (log (m+n) ) parallel 


Operations (b™+!(mtn) processors). 


Scslkt O¢< a —“x «co < cithen ¢ = "x and co —Tal= x ice 
It is easily seen that comparison of n-digit integers can be 
reduced to subtraction of n-digit integers (bmtin 


processors). 


The existence and uniqueness of g and r are guaranteed 


by the Integer Division Algorithn. Q.E.D. 


1-digit integer can be computed in O(log m) parallel 


operations on a SIMD parallel computer with b™m processors. 
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It may appear therefore that on parallel computers the 
question regarding the parallel complexity of integer 
division has finally been resolved. The above parallel 
algorithm requires O(log(mtn)) parallel operations and a 
fan-in argument can be used to show that at least log(mtn) 
parallel operations are required. The parallel algorithn, 
however, functions with a strange pecularity. In the final 
step (step 3 in the proof of Theorem 4.3.3), the parallel 
algorithm uses unrestricted fan-in to select one special 
processor from indefinitely many. This capability may be 
viewed as being a multiple operand operation and fan-in 
arguments are therefore no longer applicable. The lower 
bound of log(mtn) parallel operations for division on such a 
computational model is therefore no longer valid. However, 
the above integer division algorithm still serves a useful 
purpose. On parallel computers such as STARAN, the 
capability of determining the location of the first active 
processor is available. This parallel algorithn might 
therefore be an efficient one to implement on such parallel 


computers at least for sufficiently small problems. 
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There are many applications in which it is much more 
efficient to do arithmetic in "modular representation". 
Examples of such applications include multiple-precision 
integer arithmetic [Schoenhage & Strassen 1971, Chapter a], 
integer and polynomial linear systems [Young & Gregory 1973, 
McClellan 1973], polynomial GCDs [Brown 1971, Collins 1971, 
Moses & Yun 1973] and polynomial factorizations [ Berlekamp 


1968, Musser 1975, Wang & Rothschild 1973]. 


The advantages of modular representation are that 
addition, subtraction and multiplication are very simple. In 
addition, on parallel computers computations corresponding 
to different moduli can all be done at the same time. The 
same kind of efficiency can not be achieved by the usual 
arithmetic discussed in Chapter 4, since carry propagation 


must be considered. 


The general scheme of a modular method consists of 


three steps and is outlined as follows: 


1) Convert all the input arguments into their modular 


representation. 


2) Perform the required computations in the modulo 


classes. 


3) Convert the results back to the ordinary 


representation. 
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Thus the use of modular arithmetic can be justified only if 
efficient algorithms for conversion into and out of the 


modular representation are available. 


In Section Dien we consider polynomial modular 
transforms. The forward polynomial transform (i.e., the 
evaluation of polynomials of degree < n-1 at n points) can 
be done in -log ny parallel additions and rlog ny parallel 
multiplications and this parallel operation bound is 
Optimal. The inverse polynomial transform (i.e, the 
polynomial interpolation at n points) can be done in rlog(n- 
1jatelog ny+1 parallel additions and -log(n-1) +3 parallel 
multiplications / divisions and this parallel operation 
bound is asymptotically optimal. Both parallel algorithms 


assume polynomially bounded parallelisn. 


In Section 5.2 we consider integer modular transforms. 
Efficient parallel algorithms are given to find then 
residues of an m-digit integer and the inverse problem (i.e. 
the Chinese remainder problem) in O(log n) parallel 
Operations. Again, as in the division of integers, a more 
powerful computational model is used. These parallel 
algorithms are distinguished by the fact that all assume 


exponentially bounded parallelism. 


In Section 5.3 we consider the exact solution of linear 
systems. By using modular methods and the results obtained 
in Section 5.2, full linear systems of order n can be solved 
exactly in O(log n) parallel operations with exponentially 


bounded parallelisn. Consequently, multiplication of 
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matrices and solution of linear systems are of the same 


parallel complexity. 


In Section 5.4, by using results of Section 5.3, an 
efficient parallel algorithm is given for the division of 
polynomials. It will be shown that the quotient and the 
remainder of a polynomial of degree mtn divided by a 
polynomial of degree n can be computed in 0(19g (m+1)) 
parallel operations with exponentially bounded parallelisn. 
Consequently, polynomial multiplication and division are of 


the same parallel complexity. 
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Given any polynomial f(x) and any monic linear 
polynomial x-a, by the Polynomial Division Algorithn, there 
exist unique polynomials q(x) and r(x) such that 

f(x) = q(x) (x-a) # r(x), deg(r(x)) < deg(x-a) =1. 
The remainder r(x) is called the residue of f£ modulo the 
modulus x-a. Note that the degree of r(x) is less than 1. 
Therefore the polynomial r(x) is a constant. Substituting 
the constant a into both sides of the above equality, it 
follows immediately that 

C(x) = f(a). 
Thus finding the residue of a polynomial modulo a monic 


linear polynomial is equivalent to evaluating a polynomial. 


Furthermore, the process is reversible. 


Interpolation Theorem. Given nm residues { cr. | 


1<i<n } corcesponding to n moduli { x-a; [ 1Si<n }, 


there exists a unique polynomial f(x) of at most degree 


n-1 such that 


IA 
rw) 
1A 
=] 
e 


f(a.) = Cie 1 


Thus, any polynomial f of at most degree n-1 can be 
represented uniquely by a set of ao residues. The set of 


residues {c, | 1Sis<n } is called the modular representation 
il, 


of f modulo f{ x-a; | 1Sis<n }. The process of finding the 


modular representation for 4 polynomial is called the 


The classical sequential algorithm for evaluating 
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polynomials at many points uses Horner's rule and requires a 
total of 2n? operations. Fast sequential algorithms for the 
Same problem perform successive divisions by supermoduli 
(Cf. Chapter 3) [Fiduccia 1972, Moenck & Borodin 1972] and 
require a total of o(n log2n) operations. On parallel 
computers, it turns out that straightforward evaluations are 


optimal. 


Theorem 5.1.1 The n residues of a polynomial f(x) of 
degree n-1 can be computed in -log ny parallel additions and 
rlog ny parallel multiplications on a SIMD parallel computer 
with n2 processors. Furthermore, the parallel operation 


bound is optimal. 


Proof. Since the computation of each residue is 
independent of the other, the n residues can be evaluated 
Simultaneously. Thus, the parallel operation bound and its 
optimality follow from Theorem 3.1.1 On evaluation of 
polynomials. A total of n2 processors is sufficient because 


evaluation of f at each point uses only n processors. Q.E.D. 


The process of recovering the polynomial from its 
modular representation, i-e. exact interpolation, is called 
the inverse polynomial modular transform. The classical 
sequential algorithms for polynomial interpolation normally 
use the Lagrangian or Newtonian interpolation formula and 
require O(n2) operations (Lipson 1971]. Fast sequential 
algorithms reduce polynomial interpolation to evaluation and 


multiplication and require O(n log?n) operations [Horowitz 


1974, Moenck 6& Borodin 1972]. On parallel computers 
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polynomial interpolation can be solved efficiently by means 


of the Lagrangian interpolation formula: 


nN 


n 
£(x) = a Cy fe ( (x-a,;) /(a,-a;) je 


I1. Compute O-a,. a,-aie 3 # k and 1 < i <n, 
Sako ts 
I2. Compute A <= (a, —ay)77 tos KS. a. 
k a 


I3. Compute B, <- cif Ae ieSe Kee? on 


I4. Compute C, (x) <- i (x#(-a)), 1S ken. 
i7k L 


I5. Compute D. (x) <- BC, (x) ¢ 1< k <n. 


lh 
I6. Compute 2 D, (x) - 


Theorem 5.1.2 Polynomial interpolation at n points can 
be computed in -log(n-1)ytelog nyt2 parallel additions / 
subtractions and 2rlog(n-1)1+4 parallel multiplications / 
divisions on a SIMD parallel computer with n% processors. 


Furthermore, the parallel operation bound is asymptotically 


optimal. 


Proof. Steps I1 - I3 can obviously be computed in 1 


parallel subtraction (n? processors), rlog(n-1), parallel 
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multiplications (nt (n-1)/24 processors) and 1 parallel 


division (n processors), respectively. 


Step I4, by Theorem 3.4.2 on elementary symmetric 
functions, can de computed in -flog(n-1)4+1 parallel 
additions, flog(n-1),+1 parallel multiplications and 1 
parallel division. Since the computation of each C,.(x) uses 


n@ processors, a total of n3 processors is sufficient. 


Step I5, by Theorem 3.2.1, can be computed in 1 
parallel multiplication. Since the computation of each Dy (x) 
useS n processors (note that each D, (x) has degree n-1), a 


total of n2? processors is sufficient. 


Step I6 can be obtained by computing the sums of the 
corresponding coefficients of all D. (xp ts which in turn are 
obtained by using the log-sum algorithm. Therefore, this 
step can be computed in -log nq parallel additions (ntn/2!4 


processors). 


On summing the parallel operations, the parallel 
operation bound follows. A trivial lower bound (obtained, 
for example, by the fan-in argument) for this problem is 


rlog ny. Therefole, the parallel operation bound is 


asymptotically optimal. Q.E.D. 


The parallel algorithm given above faithfully reflects 
the Lagrangian interpolation formula. But on a close 
examination, one immediately observes that the computations 
in steps I1 - I3 are independent of the computations in step 


Ia. A careful cearrangement of the parallel algorithn, 
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requiring steps I1 - I3 to be meshed with step I4, yields a 


slightly faster parallel algorithm. 


Theorem 5.1.2" The exact polynomial interpolation at n 
points can be computed in -log(n-1)4+rlog ny+1 parallel 
additions and rlog(n-1),+3 parallel multiplications / 


divisions on a SIMD parallel computer with n? processors. 
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Modular Transforms 


The previous discussion about polynomial modular 
arithmetic has a_ strong analogy in the integer setting. 
Given any integer a and any integer c > 0, by the Integer 
Division Algorithm, there exist unique integers q and r such 
that 

an=""q) * Ci #5, 0-52 6. Cc. 
The Cemainder c is called the residue of a, modulo the 


modulus c, and denoted by r = a mod c. 


The analogue of the Interpolation Theorem in the 


integer case is the Chinese Remainder Theoren: 


Chinese Remainder Theorem. Given n residues {cr | 
1<i<n } corresponding to n mutually relatively prime 
moduli { a. | 1<i<n }, there exists a unique integer u, 


0O<u <M = M,;Moee-Mie¢ such that 


cr. = u mod Mie 5 en RS 


Thus, any integer can be represented uniquely by a 
sufficiently large set of residues. The set of residues { ©, 


{| 1s<i<n } is called the modular representation of u modulo 


By analogy with the Lagrangian interpolation formula, 


the Lagrangian Chinese remainder formula can be written as: 


n 
u = y C1a1,D mod M, 
eit el: 
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Although the Chinese Remainder Theoren does not explicitly 
require the moduli to be prime, nocmally they are so chosen 
since in practice multiplicative inverses are required at 
intermediate steps. In addition, for the sake of efficiency, 
the primes are chosen to be large 1-digit (i.e., Single 


precision) integers. 


The process of finding the modular representation for 
an integer is called the forward integer modular transform 


and its recovery fron the residues the inverse integer 


modular transforn. 


It is obvious that forward integer modular transform 
requires no moce parallel operations than does integer 
division. Conseguently, as for division, we give two 
parallel algorithms, one assuming polynomially bounded 


parallelism and the other exponentially bounded. 


1 The n residues of an integer with at most 
m digits modulo n 1-digit moduli can be computed in 0(log@n) 
parallel operations on a SIMD parallel computer With mn 


processors. 


Proof. Since, by Corollary 4.3.2, each residue can be 
computed in O(log2m) parallel operations on a SIMD parallel 


computer with m2 processors, the n residues can be computed 
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in O(log2m) parallel Operations on a SIMD parallel computer 


with m2n processors. Owais 


Theorem 5.2.2 The n residues of an integer with at most 
m digits modulo n 1-digit moduli can be computed in O(log mn) 
parallel operations on a SIMD parallel computer with b™an 


processors. 


Proof. Since, by Corollary 4.3.4, each residue can be 
computed in O(log m) parallel operations on a SIMD parallel 
computer with b™n processors, the n residues can be computed 
in O(log m) parallel operations on a SIMD parallel computer 


with b™mn processors. On neds 


Next, we show that inverse integer modular transforn 
requires no more parallel operations than does integer 
division. We give two parallel algorithms, again, one 
assuming polynomially bounded parallelism and the other 


exponentially bounded. First we prove the following lemma. 


Lemma 5.22.3 The product of n 1-digit integers can be 
computed in O(log2n) parallel operations on a SIMD parallel 


computer with n2 processors. 


of 2 factors, then products of 4 factors, and so on. At the 
Cerra face when forming terms with 2k factors (i.e., when 
taking products of terms with 2k-1 factors), log (2k-1+2k-1) 
= k parallel operations are required. Thus, a total of 

4 ¢ 2-4 cee # plog my = ‘flog ny*/2 © clog n7/2 


parallel operations is required. 
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In addition, at the k-th Stage, there are tn/2 3 
factors each using 2k-1%2k-1 = 22k-2 processors to compute. 


Thus, a total of nx2k-2 < n27y processors is sufficient. 


Q.E.D. 


Theorem 5.2.4 Given n residues modulo n 1-digit moduli, 
the inverse integer modular transform can be computed in 


O(log2(n-1)) parallel operations on a SIMD parallel computer 


with n3 processors. 


Proof. Using the same notation given in the Lagrangian 
Chinese remainder formula above, each a, is a product of n-1 
1-digit integers. By Lemma 5.2.3, { a; tel =k=08} ecany abe 
computed in O(log2(n-1)) parallel operations ((n-1)2n 
processors). Since each a, has at most n-1 digits, { a, mod 
my | 1<k<n } can be computed in another O(log? (n-1)) 


parallel operations ((n-1)2n processors). 


In addition, since each modulus m, is prime, by 
Fermat's Theoren, DL can be obtained from 


b, = (a;)71 mod m, 


(a, mod m,)—* mod m, 


wk 
(a, mod m,) -2 mod an 


k kK 
Thus { b, { 1<k<n } can be computed from { ai mod ms | 
1<k<n } in at most rlog ( max { m,-2 {| 1$kSn } )q parallel 


multiplications (n processors). 


Finally, it is obvious that the products { r,a,b, | 
1<k<n } and the Summation of them can be computed in 


O(log n) parallel operations (2(n-1)N processors) .- Since the 
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Sum has at most atlog n digits, its residue taken modulo M = 
M)M,e+-M can be computed in at most 0(log2(n-1)) parallel 


Operations (0(n2) processors). 0O.E.D. 


Theorem 5.2.5 The inverse integer modular transform of 
Theoren 5.254 can be computed in O(log n) parallel 


Operations on a SIMD computer with b2nn2 processors. 


Proof. For all integer x, 0 < x < Mm = M Mo+--M ¢ do the 


following: 


1. Define n residue functions, 
u.(x) = x modm, 1s isn. 
al al 
By Corollary 4.3.4, this can be computed in O(log nj) 


parallel operations (Mb"n2 processors}. 


2. Define n Boolean functions, 


b. (x) Bal Aff u, (x) = C.. 1. Sess 
This can be computed in 1 parallel (single-precision) 


comparison (Mn processors). 


3. Compute the Boolean product, 
c(x) = b, e b, (x) BB wee ® b (x). 
This can be computed in flog ny, parallel ANDs (M&n/23 


proceSSOrs) - 
4. If c(x) = 1 then u = x; 


The existence and uniqueness of u is guaranteed by the 


Chinese Remainder Theorem. 


Summing up all the parallel operations we have O(log n) 
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total parallel operations. Since M is bounded bya tae 


processor bound follows. Q.E.D. 


=e ) ¢ i ; a +a | 
' 7 ~~ : > 
_ eo apis! enéponeeao 
a, 7 " aenh x 7 
eis “a Ye pba BE ’ i- i 


7 
PY RY toe ; - 


76 


5-3 Exact Solution of Linear Systems 


Full linear systems of order n can be solved on 
parallel computers in O(n) parallel operations by Gauss- 
Jordan elimination [ Borodin & Munro 1075), LDU 
factorization, oc Givens reduction (Sameh & Kuck 1975}. In 
addition, certain structured sparse linear systems can be 
solved much more efficiently, e.g., triangular linear 
Systems can be solved in O(log2n) parallel operations 
{Heller 1974, Hyafil & Kung 1974, Chen & Kuck 1975, Orcutt 
1974], and tridiagonal linear systems can be solved in 


O(log n) parallel operations [Buneman 1969, Stone 1973]. 


Despite these facts, researchers present no 
Significantly faster parallel algorithm than the Gauss- 
Jordan elimination for the solution of general linear 
systems [Pease 1969, Sameh & Kuck 1975, Csanky 1974]. 
(Recently, Csanky presents a parallel algoritha which 


requires O(log2n) parallel operations.) 


In this section we present an O(log n) parallel 
algorithm for solving full linear systems. Thus, as is the 
case for sequential computers [Borodin & Munro 1975], 
multiplication and inversion of matrices are of the same 
parallel complexity, both requiring O(log n) parallel 
operations. The parallel algorithn given below is a modular 
method which gives exact solution of linear systems. Using 
modular method, all computations performed, except the 
initial and final conversions, are modular arithmetic: 


1) (atb) mod p = (a mod p + b mod p) mod p. 
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2) (axb) mod p = ( (a mod p) « (b mod Pp) ] mod p. 
3) If a # 0 moda Pe then a-! exists and 


a~* mod p = (a mod p)P-2 mod p. 


in doing modular arithmetic, it is not clear how many 
basic machine instructions are needed to implement one 
modular operation, since this is a highly machine dependent 
feature. But, for any particular computer, the total number 
of machine instructions executed must be proportional to the 
number of modular operations performed. Hence, in the rest 
of this chapter, all operations indicated are implicitly 


assumed to be modular operations. 


heorem 5.3.1 Given any n by n matrix A in GF(p), the 
inverse of A mod p can be computed in 1 parallel 
multiplication, ,log ny parallel additions and 2,rlog ny 
parallel ANDS on a SIMD parallel computer with pUn2 


processors. 


Proof. Let V(n) be the set of all n-vectors with 


components in GF(p). Obviously, |V(n) | p®. For each a- 


vector x in V(n), do the following: 


i 


1. Compute a vector function v(x) xX *« A. 
By Corollary 2.1.4, this can be computed in 1 parallel 
multiplication and -log ny parallel additions. Since each 


vector-matrix product uses n? processors to compute, a total 


of p®=n2 processors is sufficient. 


2. Define n vector Boolean functions, 


by = v@y = it) TSK Sn, 


a i fon O* 
a 

7 Te aie 
29 Row fe a! 
bans 


a 
7 
7 , 
nag wed Shei thalBE SE: yoroeqss Sze Qekubee pose 
' LAI 
is =ta04 of bh beSan AIH sno koamairaad A hax 


btone yinvit # et addd editfe maanvens j 


> 


> / <<¢ - ; 
‘eda: at*+ \Fatugsze9 IS LGILIISS yi. 307 oe 


wr fenne ta j ; : fan: he*unoerxy® 2a0r2 ‘reagan sotsone § 
ij el sStyscq wae 83617906 1b nom. 7 “aa 


ta 
LC 


’ = _ 
fortqet = son thok >) waa lta tego Leeann ® 6 ¢ 
eo itszece Gphebhew ad oe 7 naa 


_ 
a 7 7 ry 
pi jutgeos ef aad... .tog> As to |saaew 
+ ee 


++ h Re ro j f4 LAY 5h, a ¢, ky « a 
Ley deeds: tetisseg ghE2 oh oa adit (f os 


- | oe Th 907 


- gh x 


1.8; atees { sa o- 386 added (LY tged 


: on ad 
“1 GEE toe 4 = t(rpt. opfenoivdd « {ypea at ecneae t 


foe ge = Ig} zone E358: ef oe 
i _ : 7 
Iefisuag | hesugacs Md AE aide yeahs FOM 


: Fon wt 
fe. ese sanpselihe Lelierng ere ae vhiese 


aimed? od dommano sean duheay 244%ee: 
| eA 


8 wt 


78 


where i(k) is the k-th row vector of the identity matrix I 
(of order n). These functions can be obtained in 1 parallel 


comparison (p"n2 processors) . 


3. Define n scalar Boolean functions, 


Cc, (x) = the Boolean product of the Boolean vector 
be(x}701, Sk sn. 
For each k and each x, the Boolean product can be computed 
in flog ny, parallel ANDS (tn/24 processors) by using a 
Boolean version of the log-product algorithm. A total of 
p ntn/24 processors is sufficient for computing these 


functions. 


4. If C,.(x) = 1 then x(k) = x and d(k) = 1, 1¢ k <n, 
where x(k) constitutes the k-th row vector of a matrix X and 
d = (d(1), d(2), ---, d(m)) is a Boolean vector (initially 


all of which are false). 


5. Compute e = the Boolean product of the Boolean 


vector d. 


This can certainly be computed in -log ny parallel ANDs 


{tn/24 processors). 


6. If e= 1 then the inverse of A exists and A~! = X. 


Otherwise, the inverse of A does not exist. 


Counting all the parallel Operations and processors, 


the parallel operation and processor bounds follows. Q.E.D. 


The above pacallel algorithm is restricted to finite 


fields and is due to Csanky {1974]. We use it below to 
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design an efficient paraliel algorithm over the rational 


field. 


For any n by n matrix A, the following notations are 


adopted: 
a (i, j) the (i,j) entry of A 
A(i,j) the cofactor of a(i, 7) 
Adj (A) the adjoint of A 
det (A) the determinant of A. 


The following identity is well known: 
A x Adj(A) = det(A) « I (1a) 


or A~! = Adj(A) / det(A). (1b) 


We now give an efficient parallel algorithm for 
computing the determinant of A. First, define n submatrices 
of A as follows: 

AC1) =A 

AC2)? = the minor of aCt)(1,1), i.-e., the submatrix 
obtained from A by deleting the first row and the first 
column. 

ack#1) = the minor of a€k)(1,1), i.e., the 
submatrix obtained from A by deleting the first k rows and 
the first k columns. 


Obviously, 
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and ACk) (1,1) = det (ACKtt)), 1 


7 Cet: BCKI = (ACKI}=2,) 1 Sk S7n-1. Ie all 
the { Bek» 4 1sksn-1 } are known then the determinant of 


(A mod p) can be computed in rlog(n-1), additional parallel 
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multiplications and 1 parallel division. 


Proof. Express (1b) in terms of the (1,1) entries, 


det (A) = AC19 (1,1) 7 bC190 (1,1) 


det (AC2)) 7 bC19(1,1). 
Thus, the identity (1b) can be applied again and again 

det (A) = AC2) (1,1) / bC1) (1,1) eb€2)(1,1) = ee 

= a(n,n) / bO1D (1,1) eb62)9 (1,1) % ...ebC-19 (1,1), 
where each b€kK)(1,1) is the (1,1) entry of BCkK. This 
expression ¢27) can obviously be computed in rlog(n-1) 4 


parallel multiplicaitons and 1 parallel division. QoeaDs 


Once we have det(A) the solution of the linear system 
Aux = b 
can be found immediately from: 


x = A“! b= (AAU(A) & DY) 7 det (A) 


y / det(A), 
where y = Adj({A) * be Thus, an efficient solution of the 


linear system relies on an efficient solution of y. 


Lemma 5-3-3 Let z = (A mod p)~* * ( b mod p). Then ( y 


mod p) = (det(A) mod p) * Z. Hence y mod p can be computed 


from z in 1 parallel multiplication. 


Proof. The first part can be found in {Young & Gregory 
1973]. The second part is a trivial consequence of the first 


part. Q.E.D. 
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(27) This expression is true in any field, not just in 
GF (p) - 
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The last remaining problem in a modular method is the 
number of prime moduli needed to guarantee the unigue 


representation of det(A) and Ye 


Let a(i) be the i-th row vector of A, then we can 
choose the moduli { P; | 1Si<m } such that [Young & Gregory 
1973 ] 

Py Poee-P, 2 2 (haCMileiia(2)ii«x...efla(n) ti*ibl, 
where ({-{{ is the 2-norm (i.e., Euclidean norm) and |.{ is 
the 1-norm. In parallel computation the exact number of 
moduli needed is not as crucial here as in the sequential 
case, because computations corresponding to each modulus can 
be performed simultaneously. The important thing here is 
that the precision of the above bound is roughly 
proportional to an, the order of A. Thus, by Theorem 5.2.5, 
the number of parallel operations required to restore det (A) 


and y is proportional to O(log n). 
We now can proceed to give the complete algorithm: 


Algorithm L. (Exact Solution of Linear Systems) 


For each modulus p, do the following: 
L1. Find (A mod p) and ( b mod p). (28) 
L2. Compute BCk) = { (A mod p)¢k)}-1, 1 < k < n-1. 
L3. Compute det(A) mod p.- 
L4. Compute z = (A mod p)-* * ( b mod p). 


L5. Compute y mod p = (det (A) mod p) x Z. 
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L6. Recover det (A) and y. 


The solution is the quotient of y and det(A), and the 


actual division may or may not necessarily be carried out. 


Theorem 5.3.4 Full linear system of order n can be 
solved exactly in O(log nj) parallel Operations on a_e SIMD 


parallel computer with exponentially bounded parallelisn. 


Proof. The only multiple-precision arithmetic performed 
are in step £1 and step L6. Step L1 can be computed in a 
constant number of parallel operations. Step L6 can be 
computed in O(log n) parallel operations as commented 


previously. 


Steps L2 - L4& each can be computed in O(log n) parallel 
operations by Theorem 5.3.1, Theorem 5.3.2 and Corollary 
2.1.4, respectively. Step L5 can be computed in 1 parallel 


multiplication by Lemma 5.3.3. 


Thus, the complete algorithm can be computed in 
O(log n) parallel operations. Both step L2 and step L6 


assume exponentially bounded parallelisn. Q.E.D. 


Theorem 5.3.5 Matrix multiplication, matrix inversion, 


determinant calculation and solution of linear systems are 


of the same parallel complexity. 


(28) The fact that A and b must be integral is no serious 
restriction due to the fact that if the entries of A and the 
components of b are rational numbers, they can be converted 


to integers simply by scaling. 
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Proof. It is known that inversion of matrices, solution 
of linear systems and calculation of determinants are all of 
the same parallel complexity (Csanky 1974, Borodin and Munro 
1975]. By Theorem 5.3.4 and Corollary 2.1.4, the solution of 
linear systems and matrix multiplication are of the same 
parallel complexity; namely, O(log n). Therefore, all three 
of the above problems and matrix multiplication are of the 


same parallel complexity. Q.E.D. 
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Let f and g be any two polynomials of degree mtn and n, 
respectively, and gq and rc be the quotient and remainder, 
respectively, of f divided by g. By using the results 
obtained in the last section, a very efficient parallel 


algorithm for computing g and r can be derived. 


2i The quotient gq and the remainder r of f 


divided by g can be computed in O(log(m+1)) parallel 
operations on a SIMD parallel computer assuming 


exponentially bounded parallelism. (29) 


{ro 


roof. In Section 3.3, a linear recurrence equation is 
derived for the coefficients of gq. As is well-known a linear 
recurrence equation can be viewed as being a triangular 
linear system: 
Gq=t, 

where g and f£ are the (mt+1)-vectors formed by the 
coefficients of g and the first m+1 coefficients of f, 
respectively, and G is a given matrix of order m+1. Then, by 
Theorem 5.3.4, gq (consequently the polynomial q) can be 
solved in O(log (m+1) ) parallel operations (assuming 


exponentially bounded parallelism). 


Next, g * q can be computed in 1 parallel 


©€29) In Corollary 3.3.2 a parallel algorithm requiring 
0 (log (m+1) log (n+1)) parallel operations but assuming 
polynomially bounded parallelism is given. 
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multiplication and no more than rlog (m+1) 4 parallel 


additions. 


Finally, the remainder, cr = f - g * gq, can be computed 


in 1 additional parallel subtraction. Q.E.D. 


Corollary 5.4.2 Polynomial multiplication and division 


are of the same parallel complexity. 
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The main objective of this thesis was to find the best 
possible parallel algorithms (in the sense defined in 
Chapter 1) for integer arithmetic, polynomial arithmetic and 
modular arithmetic. It turns out that there exists very 
efficient parallel algorithms for all problems studied. 


Thus, the main objective of this thesis has been achieved. 


We observe, however, that in most cases there is a 
tradeoff between the number of parallel operations performed 
and the number of processors required. If the level of 


patallelism of a parallel algorithm (i.e., the number of 
processors required by the parallel algorithm) is used to 
classify parallel algorithms, then among all best possible 
parallel algorithms studied in this thesis there are clearly 
three broad classes: 

1) parallel algorithms assuming linearly bounded 
parallelism, e.g-, polynomial addition and integer 
addition. 

2) parallel algorithms assuming polynomially bounded 
parallelism, e€.gJ-, polynomial multiplication and 
integer multiplication. 

3) parallel algorithms assuming exponentially bounded 
parallelism, e€-]J-, polynomial division and integer 
division. 


The three classes of parallel algorithms can be considered 
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increasingly more difficult if level of parallelism is 


adopted as the measure of parallel complexity. 


For 


future research the following problems are 


therefore suggested: 


1) 


2) 


3) 


Design (asymptotically) optimal parallel algorithms 
assuming polynonially (or linearly) bounded 
parallelisn for polynomial division, integer 
division, the Chinese remainder problem and the 
solution of full linear systems. 

Prove parallel lower bounds under a priori 
polynomially (or linearly) bounded parallelism for 
various problems. 

Study tradeoffs between the number of parallel 
operations performed and the number of processors 
required for various problems (especially those 


listed in 1 above). 
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